# Fake News Detection - Setup & Run Guide

## Prerequisites

- Python 3.9 or higher
- pip (Python package manager)
- 4GB+ RAM recommended

---

## Step 1: Clone or Download the Project

**Option A: Using Git**

    git clone https://github.com/yourusername/fake-news-detection.git
    cd fake-news-detection

**Option B:** Download the ZIP and extract it.

---

## Step 2: Create Virtual Environment (Recommended)

**Windows:**

    python -m venv .venv
    .venv\Scripts\activate

**macOS/Linux:**

    python -m venv .venv
    source .venv/bin/activate

---

## Step 3: Install Dependencies

**Option A: ML Models Only (Logistic Regression & SVM)**

    pip install -r requirements-ml.txt

**Option B: ML + BERT Models**

    pip install -r requirements-bert.txt

---

## Step 4: Prepare Your Dataset

### Dataset Format

| Column  | Required | Description                                  |
|---------|----------|----------------------------------------------|
| text    | Yes      | News article content                         |
| label   | Yes      | FAKE/REAL or 0/1                             |
| title   | Optional | Article headline (will be merged with text)  |

### Download Dataset

Popular datasets from Kaggle:
- Fake and Real News Dataset: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset
- ISOT Fake News Dataset: https://www.kaggle.com/datasets/csmalarkodi/isot-fake-news-dataset
- Fake News Competition: https://www.kaggle.com/c/fake-news/data

### Place Dataset

Save your dataset as: `data/raw.csv`

---

## Step 5: Clean & Prepare Data

    python -m src.prepare_data --input data/raw.csv --output data/cleaned.csv

**Optional flags:**

| Flag              | Description                    |
|-------------------|--------------------------------|
| --text-col NAME   | Specify text column name       |
| --label-col NAME  | Specify label column name      |
| --title-col NAME  | Specify title column name      |
| --dropna          | Drop rows with missing values  |

**Expected output:**

    Saved cleaned dataset: data/cleaned.csv  (rows=XXXX)
    Label meaning: 1=REAL, 0=FAKE

---

## Step 6: Train Models

### Train Logistic Regression

    python -m src.train_tfidf --data data/cleaned.csv --model lr

### Train SVM

    python -m src.train_tfidf --data data/cleaned.csv --model svm

### Train DistilBERT (Optional - GPU Recommended)

    python -m src.train_bert --data data/cleaned.csv --out models/bert_distilbert

**Training options:**

| Option           | Description                      |
|------------------|----------------------------------|
| --test-size 0.2  | Test split ratio (default: 0.2)  |
| --max-features   | TF-IDF vocabulary size (40000)   |
| --ngram-max 2    | N-gram range (1 to n)            |
| --seed 42        | Random seed                      |

**Models saved to:**

- models/pipeline_lr.joblib
- models/pipeline_svm.joblib
- models/bert_distilbert/

---

## Step 7: Run the Web Application

    streamlit run app/streamlit_app.py

**Access the app:**
- Local: http://localhost:8501
- Network: http://YOUR_IP:8501

---

## Step 8: Using the App

### Single Text Prediction
1. Select model from sidebar (LR / SVM / BERT)
2. Paste news article in text area
3. Click **Predict**
4. View result: REAL or FAKE with confidence score

### Batch CSV Prediction
1. Upload CSV with text column
2. Click **Run batch prediction**
3. View results in table
4. Download predictions as CSV

---

## Quick Command Reference

Full pipeline (copy-paste ready):

    pip install -r requirements-ml.txt
    python -m src.prepare_data --input data/raw.csv --output data/cleaned.csv
    python -m src.train_tfidf --data data/cleaned.csv --model lr
    python -m src.train_tfidf --data data/cleaned.csv --model svm
    streamlit run app/streamlit_app.py

---

## CLI Prediction (Without Web UI)

    python -m src.cli_predict --model-path models/pipeline_lr.joblib --text "Your news article here"

---

## Troubleshooting

| Issue                    | Solution                                                    |
|--------------------------|-------------------------------------------------------------|
| ModuleNotFoundError      | Run pip install -r requirements-ml.txt                      |
| Model not found          | Train the model first (Step 6)                              |
| Could not infer columns  | Use --text-col and --label-col flags                        |
| Port 8501 in use         | Use streamlit run app/streamlit_app.py --server.port 8502   |
| BERT out of memory       | Reduce batch size: --batch 4                                |

---

## Project Structure

    fake-news-detection/
    │
    ├── app/
    │   └── streamlit_app.py      # Web UI
    │
    ├── data/
    │   ├── raw.csv               # Your dataset
    │   ├── cleaned.csv           # Processed dataset
    │   └── sample_input.csv      # Example format
    │
    ├── models/
    │   ├── pipeline_lr.joblib    # Trained LR model
    │   ├── pipeline_svm.joblib   # Trained SVM model
    │   └── bert_distilbert/      # Trained BERT model
    │
    ├── src/
    │   ├── prepare_data.py       # Data preprocessing
    │   ├── train_tfidf.py        # ML model training
    │   ├── train_bert.py         # BERT training
    │   ├── predict_tfidf.py      # ML inference
    │   ├── predict_bert.py       # BERT inference
    │   ├── cli_predict.py        # Command-line prediction
    │   └── utils.py              # Helper functions
    │
    ├── requirements-ml.txt       # ML dependencies
    ├── requirements-bert.txt     # BERT dependencies
    └── README.md

---

## Performance Tips

1. For faster training: Use --max-features 20000 instead of 40000
2. For better accuracy: Use larger dataset (10,000+ samples)
3. For BERT: Use GPU with CUDA for 10x faster training
4. For production: Use SVM model (good balance of speed & accuracy)

---

## Label Reference

| Label | Meaning              |
|-------|----------------------|
| 1     | REAL (Genuine news)  |
| 0     | FAKE (Misinformation)|

---

Happy detecting!
